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ABSTRACT 


This survey covers some very recent applications of data mining techniques in the field of agriculture. This is an emerging research field that is experiencing a constant 
development. In this paper, we first present two applications in this field in details; in particular, we consider the problem of discovering problematic wine 
augmentations at the early stages of the process, and the problem of predicting yield production by using sensor data information. Secondly, we briefly describe other 
problems in the field for which we found very recent contributions in the Scientific literature. 


KEYWORDS: Clustering, Fermentations, Cross-validation, k-means algorithm, Bi-cluster. 


1. INTRODUCTION 

Two years ago, one of the authors of this survey co-authored a book named “Data 
Mining in Agriculture” [15]. The book gives a wide overview of recent data min- 
ing techniques, and it also presents several applications in the field of agriculture, 
as well as in other related fields, such as biology. We will give particular attention 
to two recent works in which the authors of this paper are directly involved, and 
then we will briefly mention some other applications that looked to us to be the 
most interesting to report. The survey is organized as follows. In Section 2, we 
will present an analysis performed on datasets of wine fermentations with the 
aim of predicting problematic fermentations at the early stages of the process. In 
Section 3, we will consider the problem of predicting yield production, in which 
state-of-the-art GPS technologies are employed in connection with site-specific 
and sensor-based treatments of crops. 


2. STUDYING WINE FERMENTATIONS 

Wine is widely produced all over the world. There exist different types of wine, 
which depend by different factors, and especially by the origin of the grapes that 
are employed in the production. Acommon point for all wines is the fermentation 
process, in which the sugar contained in the grapes is transformed in alcohol. 
This is a very delicate process. When producing wine industrially, indeed, large 
quantities of wine may get spoiled because of a problematic fermentation pro- 
cess, causing losses to the industry. In order to overcome to this issue, a predic- 
tion of the problematic wine fermentations could be attempted, so that an 
enologist can interfere with the process in time for guaranteeing a good fermenta- 
tion. In order to monitor wine fermentation processes, metabolites such as, for 
example, glucose, fructose, organic acids, glycerol and ethanol can be measured. 
However, analyses are usually limited to data that are obtained within the first 3 
days of fermentation. Naturally, this is done in order to learn about a possible 
problematic fermentation at the beginning of the process. Fermentations can be 
divided in 3 classes: the first class contains normal fermentations, while the sec- 
ond and the third one contain the problematic ones. In particular, the second class 
contains fermentations which are slow, in the sense that they can bring the wine 
to the end of the production. Finally, the third class contains stuck fermentations, 


Given a certain time t during the fermentation processes, measurements taken at 
time t can be grouped together in order to form clusters. A clustering technique 
might indeed define clusters that are related to normal or problematic fermenta- 
tions by exploiting the inherent characteristics of the data. Naturally, due in large 
part to the time-variable nature of the fermentation process, fermentations can be 
assigned to different clusters for a different t. In these studies, the k-means algo- 
rithm [12] was employed for finding clusters of data points, where the number of 
clusters k was arbitrarily set to 5. 


Samples are organized on the columns of a matrix A, and therefore measure- 
ments of the same compound taken from different fermentations, but at the same 
time t, can be found on the rows of A. For each compound and each time t, there is 
a specific feature in A. A bi-cluster is a sub matrix defined by a subset of samples 
and a subset of features contained in A. As a consequence, a bi-clustering of Ais a 
partition of A in disjoint bi-clusters, whose rows and columns cover the ones inA, 
and therefore it gives a relation between samples and features in A [3]. In order to 
find a consistent bi-clustering, a fractional optimization problem with binary vari- 
ables can be defined, whose aim is to select the features that are actually relevant 
for the representation of the sample. This optimization is NP-hard [14]. A heuris- 
tic algorithm [13] can be used for the solution of this problem. This bi-clustering 
technique was able to find some interesting information regarding the com- 
pounds that are monitored during the fermentation process [15]. Figure 1 shows 
the basic Data mining process. 
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Fig. 1 Basic Data mining process 


3. PREDICTING YIELD PRODUCTION 

Yield prediction is a very important agricultural problem. Any farmer would like, 
in fact, to know, as soon as possible, how much yield he can expect. Attempts to 
solve this problem date back to the time when first farmers began to work soils in 
order to get profit. Since years, yield predictions have been performed by consid- 
ering farmer's experience on particular fields and crops. However, this knowl- 
edge can also be obtained by exploiting information given by modern technolo- 
gies, such as GPS. A multitude of sensor data can nowadays be relatively easily 
collected, so that farmers do not only harvest crops but also growing and growing 
amounts of data. These data are fine-scale, often highly correlated and carry spa- 
tial information. Figure 2 shows the final grouping. 
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Fig. 2 Classification of wine fermentations using the k-means algorithm 
with k = 5 and by grouping the clusters in 13 groups 
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4. RECENT DEVELOPMENTS IN DATA MINING AND AGRICUL- 

TURE 

The problem of predicting yield production can be solved by employing data min- 
ing techniques. Consider that sensor data are available for some time back to the 

past, where the corresponding yield productions have been recorded. All this 

information form a training set of data which can be exploited to learn how to clas- 
sify future yield productions, once new sensor data are available. There are dif- 

ferent data mining techniques that can be used for this purpose. In general, when 

considering the k-fold cross-validation technique, the original dataset can be 

divided in three parts: a training set, a validation set and a test set. Setting k equal 

to 10 or 20 is generally considered to be appropriate to remove bias. The regres- 

sion model is trained on the training set until the prediction error on the validation 

set starts to rise. Once this happens, the training process is stopped and the error 

on the test set is reported for this fold. In spatial data, due to spatial 

autocorrelation, almost identical data records may end up in training, validation 

and test sets. The following table shows Data mining methodologies and its use 

in Agriculture domain. 


Table: 1 Data mining methodologies and its use in Agriculture domain 
Methodology Applications 
K-means Forecasts of pollution in atmosphere Classifying 
soil in combination with GPS 
k-nearest Neighbor Simulating daily precipitations and other weather 
variable 
Support Vector Machine Analysis of different possible change of the 
weather scenario 


Decision Tree Analysis Prediction soil dept 


Unsupervised Clustering | Generate cluster and determine any existence of 
pattern 


Classification system for sorting and grading 
mushrooms 


WEKA Tool 


5. OTHER RECENT WORKS 

We mention in this section some other recent interesting works in the field of data 
mining and agriculture. We begin with some other works related to the produc- 
tion of wine, which has been the focus of Section 2, where data mining 
approaches are employed for the prediction of problematic wine fermentations. 
The main aim of this work is to discover in advance fermentations that are going 
to be slow or stagnant, and to interfere with the process in order to guaranteeing a 
good fermentation. Other recent studies also concern the taste of the wine that is 
produced. In [4], for example, data mining techniques are employed in order to 
predict the taste of wine. This is done by creating a training set in which a classifi- 
cation of each sample (wine) is assigned by traditional wine tasters, that gener- 
ally analyze some subjective parameters such as color, foam, flavor and savour 
of the wine. Once the classification task has been learned by exploiting the train- 
ing set, data mining techniques are then supposed to substitute traditional wine 
tasters. Wine tastes are also analyzed in relation to seasonal climate effects. 





6. CONCLUSIONS 

This review presents a quick update with respect to the state-of-the-art in the field 
of data mining and agriculture. We mainly focus our attention on two particular 
problems. The first one is the problem of identifying problematic wine fermenta- 
tions at the early stages of the process. A data mining approach to this problem 
has been discussed where the k-means algorithm was used. We described the 
recent developments on this problem, and in particular new studies where 
biclustering techniques are employed for identifying the compounds of wine that 
are most likely the cause of problematic fermentations. The second problem we 
consider is the one of predicting yield production. First approaches to this prob- 
lem were based on standard data mining techniques, such as support vector 
regression and artificial neural networks. Recent works showed how to improve 
the quality of the classifications by employing the concept of spatial 
autocorrelation. 
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